Skip to content

Fix HIP warp synchronization function conflicts for ROCm 7.0+ #15241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

slojosic-amd
Copy link
Contributor

@slojosic-amd slojosic-amd commented Aug 11, 2025

Summary

This PR should fix warp synchronization function conflicts that occur when rocWMMA headers are included

Problem

ROCm includes native warp synchronization builtin functions (__shfl_sync, __shfl_xor_sync, etc.) that require 64-bit masks and have been available since ROCm 6.2. However, when rocWMMA headers are included, they pull in these native functions which conflict with the existing compatibility shims that map to the maskless __shfl equivalents.

The conflict occurs because:

  1. Standard HIP headers (like hip_fp16.h) include amd_warp_sync_functions.h
  2. This defines native __shfl_*_sync functions expecting 64-bit masks
  3. Existing code uses 32-bit masks (0xffffffff) causing compilation failures
  4. The native functions override any compatibility macros defined later

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 11, 2025
Copy link
Collaborator

@IMbackK IMbackK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes no sense.
The rocm native __shfl_*_sync functions from amd_warp_sync_functions.h have always used 64bit masks and have always very rightly failed you for passing in a 32bit type.
But we dont use those, we instead #define the __shfl_*_sync functions to the __shfl equivalents, which have no mask.
It makes zero sense why rocwmma would make a difference here other than that the inclusion of its header is somehow causing the shfl functions from amd_warp_sync_functions.h to be used instead.

While switching over to using the __shfl_*_sync from amd_warp_sync_functions.h might look attractive from the surface, it makes little sense for us right now, as they also just call __shfl see https://github.com/ROCm/clr/blob/7f77ceeaf9317261039d8f88b5748cb514a88df9/hipamd/include/hip/amd_detail/amd_warp_sync_functions.h#L298 and where added only in rocm 6.2, while we still need to support at least rocm 6.1 as this is what is in Debian stable.

This pr is going to be a nak from me, please identify how the wrong versions of the shfl functions are being used when built with rocwmma instead.

@slojosic-amd slojosic-amd changed the title Fix HIP warp synchronization mask compatibility for ROCm 7.0+ Fix HIP warp synchronization function conflicts for ROCm 7.0+ Aug 11, 2025
@slojosic-amd
Copy link
Contributor Author

Thanks @IMbackK After deeper investigation, it doesn't have anything related to "new" rocWMMA or some "ROCm 7.0 warp mask breaking changes". The real issue here is macro change in https://github.com/ROCm/clr/blob/rocm-6.4.x-with-7.0-preview/hipamd/include/hip/amd_detail/amd_warp_sync_functions.h#L30
HIP_DISABLE_WARP_SYNC_BUILTINS was HIP_ENABLE_WARP_SYNC_BUILTINS in https://github.com/ROCm/clr/blob/rocm-6.3.x/hipamd/include/hip/amd_detail/amd_warp_sync_functions.h#L30 so for ROCm 6.5+ we should #define HIP_DISABLE_WARP_SYNC_BUILTINS instead of #undef HIP_ENABLE_WARP_SYNC_BUILTINS in https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cuda/fattn-wmma-f16.cu#L18 to prevent conflict with native built in warp sync functions after including rocwmma.hpp Please check: 74c7a2f

Image

@IMbackK
Copy link
Collaborator

IMbackK commented Aug 12, 2025

Yes the change in the behavior in amd_warp_sync_functions.h indeed is the cause of this issue, but this solution is still overly complicated, we dont need to handle separate cases here for different rocm versions. i would suggest this: #15273 as we dont make use of any of the functions this header defines anywhere in ggml

@IMbackK IMbackK closed this Aug 12, 2025
@slojosic-amd
Copy link
Contributor Author

@IMbackK I wanted to have "more obvious" backward compatible changes since HIP_ENABLE_WARP_SYNC_BUILTINS macro was introduced in hip.h file earlier and this macro is relevant for ROCm 6.4 and older but not present in ROCm 7.0. Since you deleted define/undef HIP_ENABLE_WARP_SYNC_BUILTINS and just define HIP_DISABLE_WARP_SYNC_BUILTINS which is present only in ROCm 6.5+ your solution is much cleaner and backward compatible 👍

slojosic-amd added a commit to ROCm/llama.cpp that referenced this pull request Aug 12, 2025
…atibility for ROCm 7.0+ due to requests and comments from ggml-org#15241
@slojosic-amd slojosic-amd deleted the fix/slojosic-amd/warp_sync_rocm_7 branch August 12, 2025 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants